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The reverberation time (RT) is an important parameter for characterizing the 
quality of an auditory space. Sounds in reverberant environments are subject to 
coloration- This affects speech intelligibility and sound localization. Many state- 
of-the-art audio signal processing algorithms, for example in hearing aids and tele- 
phony are expected to have the ability to evaluate the characteristics of the listening 
environment and turn on an appropriate processing strategy accordingly. Thus, a 
method for characterization of room RT based on passively received microphone 
signals represents an important enabling technology. Current RT estimators, such 
as Schroeder f s method or regression, depend on a controlled sound source, and 
thus cannot produce an online, blind RT estimate. Here, we describe a method for 
estimating RT without prior knowledge of sound sources or room geometry. The 
diffusive tail of reverberation was modeled as an exponentially damped Gaussian 
white noise process. The time constant of the decay, which provided a measure 
of the RT, was estimated using a maximum-likelihood procedure. The estimates 
were obtained continuously, and an order-statistics filter was used to extract the 
most likely RT from the accumulated estimates. The procedure was illustrated for 
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connected speech. Results obtained for simulated and real room data are in good 
agreement with the real RT values. 

L INTRODUCTION 

The estimation of room reverberation time (KT) has existed in the field of engineering 
and architectural acoustics for nearly a century. The RT of a room specifies the duration 
for which a sound persists after it has been switched off. The persistence of sound is due 
to the multiple reflections of sound from the various surfaces within the room. RT is com- 
monly referred to as the time since the accepted measure of persistence of sound is 
the time to decay 60 dB below its peak value. Since reverberation results tn temporal and 
spectral smearing of the sound pattern, thus distorting both envelope and fine structure, 
the RT of a room provides a measure of the listening quality of the room. This is of par- 
ticular importance in speech perception where it has been noted that speech intelligibility 
reduces as the RT increases, due to masking within and across phonemes ([1, 5, 1 1]). State- 
of-the-art hearing aids, or other audio processing instruments, which implement signal 
processing strategies tailored to specific listening environments, are expected to have the 
ability to evaluate the characteristics of the environment, and turn on the most appro- 
priate strategy. Thus, a method that can characterize the RT of a room from passively 
received microphone signals represents an important enabling technology. 

In the early 20th century, Sabine ([17]) provided an empirical formula for the explicit 
determination of RT based solely on the geometry of the environment (volume and sur- 
face area) and the absorptive characteristics of its surfaces. Since then, Sabine's reverber- 
ation time equation has been extensively modified and its accuracy improved (see [7] for 
a historical review of the modifications), so that today it finds use in a number of com- 
mercial software packages for the acoustic design of interiors. Formulae for calculation 
of RT are used primarily for the design of concert halls, classrooms, and other acoustic 
spaces where the quality of the received sound is of greatest importance, and the extent 
of reverberation has to be controlled. However, in order to determine the RT of exist- 
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ing environments, both the geometry and the absorptive characteristics have to be first 
determined, When these cannot be easily determined, it is necessary to search for other 
methods, such as those based purely on controlled recordings of sounds in the environ- 
ment to be tested. 

Schroedcr ([18, 19]) presented a method for estimating RT based solely on the record- 
ing of an acoustic signal, radiated into the test enclosure. The method obviates some of 
the problems occurring in situations where the geometry and surface absorption charac- 
teristics are unknown. A burst of broad- or narrow-band noise is radiated into the test 
enclosure until it reaches steady state, and it is then abruptly switched off. The method 
tracks the sound energy decay following sound cessation ([18]). This method, referred 
to as Schroeder *s backward integration method, while theoretically and practically im- 
portant, has some limitations. Specifically the sound used for measuring RT must be 
stationary and uncorrected, and the precise time of sound offset must be known- 
While Schroeder s method has been improved over the years (see [2, 22] for example), 
the improvements do not lift the restrictions placed on the applicability of Schroeder 's 
method. At present, a "blind H method that requires no knowledge of the geometry, ab- 
sorptive characteristics, and sound source characteristics or offset time of the sound, is 
unavailable. Partially blind methods have been developed where the characteristics of 
the room are "learned" using neural network approaches ([3 ( 12, 21]), or where some 
form of segmentation procedure is used for detecting gaps in sounds so that the sound 
decay curve can be tracked ([8]). The only other method that may be described as truly 
blind is "blind dereverberation". where sound source recovery is attempted by decon- 
volving the room output with the unknown room impulse response. In principle, this 
method can be used for extracting the RT, but there are serious drawbacks that limit its 
applicability. Namely, the room impulse response must be minimum phase, a condition 
that is rarely satisfied ([10, 13]). 

Here we attempt to address several of the drawbacks found in existing methods by 
providing a blind approach that requires only one recording microphone. It does not 
require that a test signal be radiated into the test enclosure (as in Schroeder's method) 



BEST AVAILABLE COPY 



04/25/2003 12:11 KAA 217 265 5530 



OFC TECH MGT UIUC 



12014 



4 

or that the geometry and absorption characteristics of the test enclosure be known (as 
with the Sabine type formulae). The system performs blind estimation based on a de- 
cay curve model describing the reverberation characteristics of the enclosure. Sounds 
in the test enclosure (speech, music, or other pre-existing sounds) are continuously pro- 
cessed and a running estimate of the reverberation time is produced by the system using 
a maximum-likelihood parameter estimation procedure. A decision-making step then 
collects estimates of RT over a period of time and arrives at the most likely RT using an 
order-statistics filter. 

S. THEORY 

A model for blind estimation of reverberation time is presented, followed by an algo- 
rithm for implementation, and a decision-making strategy for selecting the estimate best 
representing the reverberation time of listening rooms. 

Before describing the model, we motivate the work with an example. The recorded 
response of a room to an impulsive sound source (a hand-clap) is shown in Fig. 1 A. As 
can be expected, there are strong early reflections followed by a decaying reverberant tail. 
If the early reflections are ignored, the decay rate of the tail can be estimated from the 
envelope. A common and widely used measure of the decay time is the jT$o time first 
defined by Sabine [17], which measures the time taken for a sound to drop 60 dB below 
the level at sound cessation. In practice, a decaying sound in a real environment reaches 
the ambient noise floor, thus limiting the dynamic range of the measured sound to values 
less than 60 dB, and so it is not possible to directly measure T ao , Instead, the time to reach 
-25 dB or -35 dB from a reference level of -5dB is often used 118], These values can be 
extrapolated to obtain r 60 - Figure 1C shows the measurement of Teo from the hand-clap 
data using Schroeder's method [18] described below Schroeder's method, while exact, 
suffers from the drawback that the precise instant of cessation of sound must be known, 
and there must be a sufficiently long period of silence to perform the estimation. Thus, 
It is not amenable to online implementation when sounds such as connected speech are 
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present. 

We begin with a model for the diffusive or reverberant tail of sounds In a room. This 
refers to the dense reflections that follow the early reflections. All that can be said about 
them is that they are the result of multiple reflections, and appear in random order, with 
successive reflections being damped to a greater degree if they occur later in time. Tra- 
ditionally, and dating back to Sabine, the decay envelope has been modeled as an expo- 
nential with a single time-constant Since the dense reflections are assumed to be uncor- 
related, a convenient and highly simplified model is to consider the reverberant tail to be 
an exponentially damped uncorrected noise sequence with gaussian characteristics. The 
model does not include the direct sound or early reflections. The goal is to estimate the 
time-constant of the envelope. 

A. Model of sound decay 

We assume that the reverberant tail of a decaying sound y is the product of a fine struc- 
ture x that is random process, and an envelope a that is deterministic. A central assump- 
tion is that x is a wide-band process subject to rapid fluctuations, whereas the variations 
in a are over much longer time-scales. Here, we will provide a statistical description of 
the reverberant tail with the goal of estimating the decay time-constant of the envelope. 

Let the fine structure of the reverberant tail be denoted by a random sequence x(n), n > 
0 of independent and identically random variables drawn from the normal distribution 
>f(0, <t). Further, for each n we define a deterministic constant a(n) > 0. The model for 
room decay then suggests that the observations y are specified by the sequence y(n) = 
<x(n)s(n). Due to the time varying term a(n) t the y(n) are independent but not identically 
distributed, and their probability density function is >f(0, aa{n)\ That is, the constant a(n) 
modulates the instantaneous power of the fine structure. For purposes of estimating the 
room decay time, we consider a finite sequence of observations, n = 0, . . . , N where N 
will be referred to as the estimation interval or estimation window length. For notational 
simplicity, denote the itf-dlmensional vectors of y and a by y and a, respectively. Then 
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Figure i: Temporal decay of a hand-clap at t = 04 s as recorded by a microphone (left column) 
and the model matching the reverberation (right column). (A) The recorded sound shows strong 
early reflections followed by a reverberant tail. Direct sound is excluded from the trace. (B) Model 
matching the reverberant tail shown in (A). Direct and early reflections are excluded The model 
is a Gaussian white noise process damped by a decaying exponential, parametrized by the noise 
power a and decay time-constant r. (C) Decay time-constant estimated from Schroeder's back- 
ward integration method 118] between -5 dB (0) and -25 dB (o). Slope of linear fit (dashed line) 
yields r = 59 ms (Tso 0.4 s). (D) Decay curve for model has identical slope everywhere follow- 
ing sound offset and captures the most significant part of decay (-5 dB to -25 dB). 

the likelihood function of y (the Joint probability density), parameterized by o and cr is 

L(y; a, a) = ^ \ (J—)W rrrxrf E^M n )M n )) 8 ^ 

where a and a are the (N + 1) unknown paramaters to be estimated from the observa- 
tion y. The likelihood function given by (1) is somewhat general, and while it is possible 
to develop a procedure for estimating all (JV + 1) parameters, suitable simplifications can 
be made when modeling sound decay in a room. Let a single time-constant r describe 
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the damping of sound envelope during free decay. Then the sequence a(n) is uniquely 
determined by 

a(n) = exp(-n/r). (2) 

Thus, the i\T-dimensional parameter a can be replaced by a scalar parameter a that is 
expressible in terms of r and a single parameter a = exp(-l/r), so that 

o(n)-a« (3) 

Introducing (3) into (1) yields 

- *> - 'g^' TO ^- E;i "'^ ii( '"' )- « 

For a fixed observation window iV, and a sequence of observations y(n), the likelihood 
function is parameterized solely by the time-constant a and the diffusive power cr. 

The model is shown in Fig. IB with parameters a and a matched to the experimental 
hand-clap data shown in Fig. 1 A- Note that the model does not include the early reflec- 
tions shown in panel A. The Schroeder decay curve for the model is shown in Fig. ID with 
a r 60 time of 0.4 s in agreement with the measured T 60 . The agreement between model 
and real Tw time motivates the search for an algorithm that can optimally estimate the 
two parameters. 

B. Maximum-likelihood estimation 

Given the likelihood function, the parameters a and <r can be estimated using a 
maximum-likelihood approach [15]. First, we take the logarithm of (4) to obtain the log- 
likelihood function 
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Then, we differentiate the log-likelihood function (5) with respect to a to obtain the 
score function [15] 



, x 01nL(y; a, <r) 
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The log-iikelihood function achieves an extremum when din L(y; a, a) /da = 0. That 
is, when 

n=0 

The zero of the score function provides a best estimate in the sense that E[s a ] = 0. 

Denote the zero of the score function s ai and satisfying (8), by a*. It can be shown 
that the second derivative o^lnLfo; a*, a) /da 2 < 0, i,e.. the estimate a* maximizes the 
log-likelihood function* 

The diffusive power of the reverberant tail or variance <P can be estimated in a similar 
manner. Differentiating the log-likelihood function (5) with respect to <r, we have 



, . dlnLly: a, a) 

V, a) g-i-i, (9) 

= -7 + ^E fl -N(n) 2 , (10) 

nmO 



which achieves an extremum when 
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As before, it can be shown that the E^] 0. Denote the zero of the score 
function 3*. and satisfying (11), by cr*. It can be shown that the second derivative 
d 2 In L(y; a, <t*)/do* < 0. i.e., the estimate a m maximizes the log-likelihood function. Note 
that the maximum-likelihood equation given by (8) is a transcendental equation and can- 
not be inverted to solve directly for a*, whereas the solution of (11) for cr* is direct This is 
considered in detail later when an algorithm for estimation is developed. 

Bounds on the estimate of a and a are obtained from the variance of the score function, 
also called the Fisher information J. This is more conveniendy expressed in terms of 
the derivatives of the score functions [151 . Given the parameter 0 T = [a cr] and the score 
function 6) = [s a (y] a, <j) ^(y; a, a)], we have 

j W _ El £*L% (12) 
From (7), (9). and (12). we have 



■ 



By the Cramer-Rao theorem [15], a lower bound on the variance of any unbiased esti- 
mator is simply J" 1 ^). which Is 

From the asymptotic properties of maximum-likelihood estimators [15], we know that 
the estimates of a and a are asymptotically unbiased and their variances achieve the 
Cramer-Rao lower bound (Le., they are efficient estimates). Thus, if a* and a* are the 
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estimates obtained from the solutions of (8) and (11), the variance of the estimates are 



W-^*NW=Ty < 15 > 

with equality being achieved in the limit of large N. Since the variance of a and a 
are 0(N~*) and 0(N- X ), the estimation error can be made arbitrarily small if observation 
windows are made sufficiently large. 



C. Algorithm for estimating decay time 

Given an estimation window length and the sequence of observations y(n) in the win- 
dow, the zero of the score function (8) provides an estimate of a. The function is a tran- 
scendental equation that must be solved numerically using an iterative procedure. Since 
the estimate of a can be obtained directly from (11), a two-step procedure was followed. 
First, an approximate solution for a* from (8) was obtained, and next, the value of a* was 
updated from (11). The procedure was repeated, providing succesively better approxi- 
mations to or and <r+, and so converging on the root of (8), 

Here we address the strategy for extracting the root in the smallest number of iterative 
steps. To gain an understanding of the root-solving procedure, we consider the example 
shown in Fig. 2. The function a = exp(-l/r) maps the room time-constant r one-to-one 
and onto a as shown in Fig. 2 A. For instance, consider a room time-constant of 0.1 s and 
a sampling rate of 16 kHz. Then the time-constant is 1600 samples, and so a ~ 0.9994 
(filled circle). The signficance of the number becomes clear if we consider that when the 
time-constant is 0.03 s, then a = 0.9979, whereas for r = co, a = 1. Hence the geometric 
ratio is highly compressive and values of a for real environments are likely to be close 
to 1. Thus, the advantage of estimating a rather than r is due to the bounded nature of 



Uli/ tl/UJ J.i.i£ fAA £1/ <Q9 0300 



um; j/jslm mux iuiic 



@02l 



11 




1 1 1 1 1 I J l i —1 

60 80 130 350 In* BO BO 130 250 Inf 
Time constant (ms) Time constant (ms) 



figure 2: maximum-likelihood estimation (MLE) of room decay time-constant (A) The time- 
constant of the exponential decay (r, abscissa) is mapped to a parameter a = cxp(- 1 /r) (ordinate) 
where r is given in sampling periods. The function is monotone but highly compressive and 
maps r G [0, oo) onto a 6 [0, 1). Filled circle shows r 100 ms (a = 0.9994). (B) Score function 
(derivative of log likelihood function) s a (a) (ordinate), decreases rapidly as a function of a (ab- 
scissa, marked in time constants using the map in (A)) . MLE of a is given by the root of *(a) (filled 
circle) . (C) The derivative s a t(a) as a function of a. At the root of s a (filled circle), the derivative is 
negative. Note the nearly 8-12 orders of magnitude change in s a and s*f for commonly encoun- 
tered values of r. (D) The ratio * a (a)/s a /(a) (ordinate) as a function of a is the incremental step 
size of the Newion-Raphson procedure for finding the root of (8). It provides an estimate of the 
convergence properties of the root-finding algorithm. Sampling frequency was 16 kHz, and the 
log-likelihood function was calculated assuming a 400 ms window 

a. The score function s a from (7) on the other hand, has a wide range (about 8 orders of 
magnitude, see Fig. 2B) and is zero at the room time-constant (filled circle). The gradient 
of the score function dsjda shown in Fig. 2C also demonstrates a wide range, but takes a 
negative value at the 2ero of s a . 

Thus, if we start with an initial value of ajj < a, the root-solving strategy must descend 
the gradient sufficiendy rapidly. The standard method for solving this kind of nonlinear 
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equation, where an explicit form for the gradient is available, is the Newton-Raphson 
method which offers second-order convergence [16], The order of convergence can be as- 
sessed from s a (d3 a /da)- 1 which is the incremental step size Aa in the iterative procedure 
(Fig. 2D). For example, with true value of r 100 ms. Aa at intermediate values in the 
iteration can be as small as 10~ 6 when a = 0.9993 (r = 90 ms) or a = 0.9995 (r - 120 ms). 
This corresponds to an incremental improvement of about 0.01 ms every iteration, thus 
providing slow convergence Jf the initial value is far from the zero. On the other hand, 
the bisection method [16] guarantees rapid gradient descent but works poorly in regions 
where the gradient changes relatively slowly (such as near the true value of a). Further- 
more, it guarantees only first-order convergence. 

However, the specific structure of the root-solving problem can be exploited since the 
behavior of s a is known. Here, both methods were used to obtain convergence to the root 
First the root was bisected until the zero was bracketed as close as possible, after which 
the Newton-Raphson method was applied to polish the root For the example shown, the 
root bracketing was accomplished in about 8 steps and the root polishing in 2-4 steps. In 
contrast with the same initial conditions, the Newton-Raphson method took about 500 
steps to converge. Taken together, the analysis presented here suggests that the estimation 
procedure is feasible and does not lead to significant errors even though values of a for 
real rooms are close to 1, and the score function and its derivative vary over many orders 
of magnitude. While other algorithms are possible, these are not dealt with here. 

D. Strategy for assigning the correct decay time from the estimates 

The theory presented in the preceding section provides one estimate of a and a in a 
given time frame of N samples. By advancing the frame as the signal evolves in time, 
a series of estimates a% are obtained, where k is the time frame. We assume that after a 
sufficient number of estimates have been obtained, a decision is made regarding the true 
value of the decay time-constant. In this section, we present a strategy for making such a 
decision. 
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Since we are considering blind estimation, the input is unknown, and so the model will 
fail when: 1) an estimate is obtained in a frame that is not ocurring during a free decay. 
This includes regions where there is sound onset or sound is ongoing. In these periods, 
the MLE scheme can provide widely fluctuating or implausible estimates due to model 
failure. 2) During a region of free decay initiated by a sound with a gradual rather than 
rapid offset In this case, the offset decay of the sound will be convolved with the room 
response, prolonging the sound even further and so, the estimated time-constant will be 
larger than the real room time-constant. Gradual offsets occur in many natural sounds, 
such as terminating vowels in speech. We address both issues here and provide a strategy 
for selecting the correct room time constant 

In the first case where the estimation frames do not fall within a region of free decay, 
many of the time frames will provide estimates of a close to unity (i.e„ infinite r), or 
Implausible values. On the other hand, the estimates will accurately track the true value 
when a free decay occurs. Intuitively, a strategy for selecting a from the sequence is 
guided by the following observation: the damping of sound in a room cannot occur at 
a rate faster than the free decay, and thus all estimates a* must attain the true value of a 
as a lower bound. The bound is achieved only when a sound terminates abruptly, upon 
which the model conditions will be satisfied, and the estimator will track the true value 
of the time-constant. 

While it seems intuitive to set a « min{a£}, it should be recognized that even during 
a free decay the estimate is inherently variable, and so selecting the minimum is likely to 
underestimate a. 

A robust strategy would be to select a threshold value of a* such that the left tail of the 
probability density function of a\ p(o*). occupies a pre-specified percentile value 7. This 
can be implemented using an order statistics filter specified by 



For a unimodal symmetric distribution with 7 = 0.5 the filter will track the peak (me- 
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dian) value. Order statistics filters play an important role in robust estimation, especially 
when data is contaminated with outliers (14], as is the case here. It should be noted that 
for 7 values approaching 0 the filter (17) performs like the minimum filter a = min{a;} 
suggested above. 

In the second case described above, where the sound offset is gradual, p(a*) is likely 
to be multimodal since sound offsets (such as terminating phonemes in speech) will have 
varying rates of decay, and their presence will give rise to a multiple peaks. The strategy 
then, is to select the first dominant peak in p(a*) when a* is increasing from zero (i.e., 
leftmost peak) . That is, 

a = min arg {dp(a 0 )/da m = 0}, (18) 

where the minimum is taken over all zeros of the equation. If the histogram is unl- 
modal but asymmetric, the filter tracks the mode and resembles the order-statistics filter. 

In conversational speech, where peaks cannot be clearly discriminated or the distribu- 
tion is multimodal, (17) can be employed by choosing a value of 7 based on the statistics 
of gap durations. For Instance, if gaps constitute approximately 10% of total duration, 
then 7 = 0.1 would be a reasonable choice. A judicious choice of 7 can result in the filter 
performing like an edge detector, since it captures the transition from larger to smaller 
values of the time-evolving sequence c£. 

The dedsion strategies, (17) and (18), were used to validate the model in simulated 
and real environments (see Results). 

m. EXPERIMENTAL METHODS 

In addition to simulations, the MLE approach was validated with real room data. The 
experimental methods and data analysis procedures are described in the following sec- 
tions. 
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A, Sound recordings 



To validate the MLE method, sounds recordings were made in several rooms, corri- 
dors and an auditorium, with the aim of determining their reverberation times. Sound 
stimuli that were used included 18-tap maximum length (ML) sequences (period length 
of 2 18 - 1), clicks (100 /i s), hand-claps, word utterances, and connected speech from the 
TIMTT corpus- Recordings were made using a single Sennhelser MK-II omni-directional 
microphone (frequency response 100-20000 Hz). Microphone cables (Sennhelser KA 100 
S-60) were connected to the XLR input of a portable PC-based sound recording device 
(Sound Devices USBPre 1.5). The recorder transmitted data sampled at 44.1 kHz to a lap- 
top computer (Compaq Presario 1700, running Microsoft Windows XP) via a USB link. 
Sound stimuli, stored as single-channel pre-sampled (44.1 kHz) WAV files, were played 
through the headphone output of the laptop, amplified by a power amplifier (ADCOM 
GFA-535II) and presented through a loudspeaker (Analog and Digital Systems Inc M ADS 
L200e, 85-20000 Hz). Data acquisition and test material playback were controlled by a 
custom written script in MATLAB (The MathWorks Inc.,) using the Sound PC Toolbox 
fTorstcn Marquardt). 

B. Measurement ofTeo time using Schroeder's method 

To validate the estimation procedure developed in this work, experimentally recorded 
data from real listening environments were processed using the ML procedure and com- 
pared to results obtained from a widely used method developed by Schroeder [18]. The 
method proposed by Schroeder determines the decay time-constant from the sound de- 
cay curve following the cessation of a broad- or narrow-band noise burst Briefly, if r(t) is 
the measured decay curve from a single trial, then the mean squared average of the decay 
curve s(t) over a large number of trials is related to r(t) by 



E[s 2 (t)} = J r 2 (x)dx> 



(19) 
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where the sound is assumed to switch off at t = 0, Schroeder's method, called the 
backward integration method, can be applied to a single broad-band channel or to multi- 
ple narrow-band channels. The recorded data were filtered offline in ISO one-third octave 
bands (21 bands with center frequencies ranging from 100-10000 Hz) using a fourth-order 
Type II Chebyshev band-pass filter with stopband ripple 20 dB down. The outputfirom 
each channel was processed by the ML procedure and Schroeder's method using (19). 
For the broad-band estimation, the microphone output was processed directly using the 
two methods. 

Due to the limited dynamic range of sounds in real environments, Schroeder's method 
requires the specification of a decay range. The decay ranges normally used are- from 
-5 dB to -25 dB (20 dB range), and from -5 dB to -35 dB (30 dB range). The decay 
curves In each range were fitted to a regression line using a nonlinear least squares fitting 
function (function nonlinsq provided by MATLAB). The fitted function was of the form 
A o%, where A is a constant, n is sample number within the decay window, and a* is the 
geometric ratio related to the decay time-constant by = exp(-l/r t i). This is in contrast 
to the model (2) which assumes an exponentially decaying envelope with time-constant 
r. whereas the decay curve Is obtained by squaring the signal. Hence, r d = r/2. Since 
decay curves were fitted to the -5 to -25 dB, and -5 to -35 dB drop-offs, two estimates 
of the time-constant were obtained. The fitted line was extrapolated to obtain the % time 
(in seconds) for each of the estimates using 

6 6 t 

Too = , ? ni — 7^ = . ~ ' = 13.82 r a . (20) 

The same procedure was followed for determining the time-constant from the broad- 
band signal. It should be noted that the ML procedure does not requirethe specification 
of a decay range, but only the specification of the estimation window length. So only one 
estimate per band is obtained. 
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C. Verification of MLE procedure with ideal stimuli 

Microphone data were processed using the MLE procedure to obtain a running esti- 
mate of the decay time-constant For model verification, estimation was performed on: 

1) the segment following the cessation of a maximum-length sequence or a hand-dap, and 

2) the entire run of a string of isolated word utterances. These were considered Jdealstim- 
uli since they fulfilled the model assumptions of free decay or long gaps between sounds. 
The estimates were binned for each run and a histogram was produced. The histogram 
was examined for peaks, and the time-constant was selected using the order-statistics fil- 
ter (18) if there were multiple peaks, or (17) if the histogram was unimodal- The estimate 
a so obtained was used to calculate the Teo time (in seconds) using the formula 

Tw = We-' 3 ) tof.w = bi^) " 69lT * ; (21) 

In theory, the Teo expressions given by (20) and (21) are identical due to the relationship 
between r and t*. However, the calculated values may differ, and this can be ascribed to 
either model inadequacies or discrepancies in measurement and analysis. 



D. Verification of MLE procedure for speech 

The performance of the MLE was also verified using connected speech played back in a 
real room. Test material were connected sentences from the Connected Speech Test (CST) 
corpus. Estimates from non-overlapping 1 s intervals were binned to yield a histogram, 
and the first dominant peak from the left of the histogram was selected to determine the 
room time-constant. The procedure for calculating Too time followed (21). 
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Figure 3; Illustration of procedure for continuous estimation of room decay time. A burst of 
white noise was applied at time t = 0.1 s (black bar, bottom trace, 100 ms duration). Simulated 
room output (bottom trace) shows the build-up and decay of sound in the room. A miming 
estimate of the parameter a in 200 ms windows is shown in the graph (ordinate, a shown in units 
of time-constant). Thie value of room decay time (100 ms) is shown as horizontal dashed line. 
The estimation window was advanced by one sample to obtain the trace, with each point at time 
t being the estimate in the window (t - 0.2, t]. During the build-up and ongoing phase bf the 
sound, estimated a sometimes exceeded 1 (i.e-, negative values of r). These were discarded and 
are not shown. As the window moved into the region of sound decay ft > 0.3 $), the estimates 
converged to the correct value. A histogram of the estimated time-constant is shown to the right 
of the trace. An order-statistics filter, such as the mode of the histogram, can be used to extract 
the room time-constant Sampling rate was 16 kHz. 

IV. RESULTS 

The estimation procedure was applied to a variety of data sets, Including simulated 
data and real room responses. To illustrate the methods and identify the strengths and de- 
ficiencies of the estimation procedure, wc first consider simulated data sets. Subsequently 
we will provide results for real data that validate the room time-constant estimates, and 
compare these to results from Schroeder's method. 
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A. Broad band white noise bursts in simulated rooms 

A 100 ms burst of broad-band white noise (8 kHz bandwidth) was radiated into a 
simulated room having a decay time-constant r = 100 ms (Fig. 3). Room output shown 
in the bottom trace of Fig. 3 shows the characteristic rise and decay of sound folbwing 
onset and offset of noise burst (horizontal bar). The graph shows the running estimate 
of decay time-constant obtained in a 200 ms time window by advancing every sample- 
Since time frames up to about t = 0.3 s are not regions of free decay, the estimator tended 
to produce values of a > 1. When this was observed in the root-bracketing step of the 
estimate, the root-solving procedure was aborted. Thus all estimates of a were bounded 
above by 1. It can be seen that when the window crosses into the region of free decay 
the estimator output stabilizes at the true value (horizontal dashed line). A histogram 
of the time-constant estimates (right axis) was input to the order statistics filter (17) with 
7 = 0.5. The reported time-constant from the filter was a = 101 ms. 

For comparison, the procedure was repeated with the simulated noise burst to jjnimic 
anechoic conditions. The histogram of a* demonstrated a strong peak at a = V (r = 
oo) (not shown). This showed that no consistent estimate of a was found. While this 
result acted as a control, histograms showing strong peaks at a = 1 are to be expected in 
anechoic or open spaces- 

B. Effect of filter length on estimation 

A parameter that is critical for estimation performance is the window length N isped- 
fied in (8). Small window lengths are expected to increase the variance of the estimate, as 
also indicated by the Cramer-Rao lower bound (15). This is shown in Fig. 4. A burst of 
white noise (100 ms duration) was convolved with a simulated room impulse response 
(r = 100 ms) and the estimator tracked the decay curve. The left column depicts the 
running estimate for four different window lengths: 0.5r, r. Zr. and 4r, top to bottom, 
respectively (dashed line: r). The right column depicts the corresponding histograms. As 
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60 100 300 Inf 

Tlmo constant (ms) 

Figure 4: Effect of estimation window length on the variance of the estimate. The simulation 
shown In Fig. 3 was repeated for windows of duration 0.5r, t, 2t< and 4r (top to bottom), Where 
r = 100 ms is the true value of the room time-constant. Left column shows the running estimate 
of parameter a (ordinate, shown as time-constant in ms) as a function of time (abscissaj. The 
right column shows the histogram of the estimates. The variance of the estimate decreases with 
increasing window length; however there is no bias introduced by the estimator as evident from 
the location of the peak in the histogram (arrowheads mark true value of r). 

window length increased, the MLE procedure gave improved estimates. We concluded 
that increasing window length reduced the variability In the estimates, and did not intro- 
duce significant bias. 



While it is desirable to have a large filter length, in practice this is limited by the; 



dura- 
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Oon and occurence of gaps between sound segments in the room output. Ideally theifilter 
length should be of the order of r or longer, but if the gaps are short, then increasing the 
Alter length beyond the mean gap will produce undesirable end effects where thd next 
sound segment creeps into the window. Thus, the window length should not be les$ than 
one-half or one-third of r, but the upper limit is dictated by the mean duration of gaps. 

C. Speech sounds in simulated room 

The examples considered above illustrated the performance of the algorithm when 
the input was broad band white noise. To be applicable in realistic conditions, the al- 
gorithm must perform in a variety of conditions and signal types, most notably inithose 
that include speech. Speech represents an example where the algorithm is expected to 
perform poorly, since it is nonstalionary and nongaussian. Further, the offset transients 
in speech sounds (even for plosives) have a decay time that can vary from 5-40 ms. ;Thus, 
estimation of decay times with speech presents a particular challenge to the algorithm. 

We took a sequence of 15 distinct and isolated words recorded in an anechoic enVlron- 

i 

ment at a sampling rate of 20 kHz. These included 1 1 consonant-vowel-consonant words 
(/p,b,g/V/d/, e.g,, "bed"), and 4 consonant-vowel words (/b/V/. e.g., "bay") separated 
by a mean interval of 200 ms. These were convolved with a simulated room impulse re- 
sponse having time-constant r = 100 ms. The task of the estimator was to track the decays 
for the entire duration of the sequence (approximately 11.4 s). The control conditioin was 
the clean input (i.e., anechoic). The results are shown In Fig. 5. Four different filter lengths 
were used as in Fig. 4. For the control condition (left column) no reliable estimates were 
produced for the smallest three windows (top three panels) since the histogram peaked 
at values of r approaching oo. For ihe simulated room response (right column), thi peak 
shifted towards the true value of r, with the best estimates being obtained for the largest 
window size of 4r (right column, bottom row). In all the histograms the peak was located 
at about 115 ms (arrow). This estimate deviated from the real time-constant of 100 ms 
due to the lack of sharp transients in the clean speech, A gradual sound offset tends to 
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Figure 5: Estimation of room time-constant from speech. Fifteen words recorded in aiji ane- 
choic (clean) environment (200 ma inter-word spacing) were convolved with a room model 
(r - 100 ms). Histograms of decay time-constants were estimated from clean (left column) and 
simulated reverberant responses (right column), and are shown for window durations 0;5r, r, 
2t, and 4r (top to bottom). The histogram for clean speech served as a control. Description fol- 
lows Fig. 4. Estimation from reverberant speech produces a clearly defined peak, especially for 
the longer window lengths, albeit with a small bias (right column, 2r and 4r). The bias can be 
attributed to the gradually decaying offsets inherent in speech so that the resultant decay is a con- 
volution of speech offset and the room response. For the control condition (left column), thej offset 
decay is visible only in the bottom two rows where the histogram exhibits a broad bump between 
50 and 100 ms. The fifteen words included 11 /p,b,g/V/d/ and 4 /b/V/ sampled at 20 kHz. 
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i 

prolong the reverberated sound even further. This can be seen In the "anechoic" control 
condition where a small peak is noticeable when window size is 4r (bottom panel, left 
column). The peak occurs around 60 ms, and corresponds to the gradual offsets of speech 
sounds. Thus, this introduces a bias in the estimates under reverberant conditions, j 

The results of the preceding sections demonstrate the importance of selecting a suit- 
able estimation window length. The choice of window length determines the variability 
of the estimates, and is critical to obtaining a histogram with a clearly resolved p<bak at 
the true value of the room time-constant Since the decision-making scheme uses an jorder 
statistics filter, which is a nonlinear operation, the robustness to the histogram variability 
are not known. Bimodal and multimodal histograms may be obtained if there is fluctuat- 
ing background noise or the sound segments have an intrinsic offset decay rate (as shown 

above). I 

j 



D. The effect on estimation of offset decay in speech 



The preceding section introduced the problem of estimating room decay time-cohstant 
when the input signal exhibited varying offset decays. Here we examine in greater detail 
the performance of the estimator with input comprising a single word (/b/V/, "boiigh"). 
The word was recorded under anechoic conditions and presented to the estimator? with- 
out modification so that the effect of the vowel offset could be determined. The ijesults 
are shown in Fig. 6. The terminating vowel has a gradually decaying offset (top ganel). 
Estimation of the offset decay was performed from t = 0.45 s (vertical dashed line}. Two 
procedures were employed. First, the envelope was extracted from the analytic signal, 

i 

windowed, and filtered to eliminate frequency components above 100 Hz. The envelope 
is shown in the middle panel (heavy outline). The envelope was then squared andjtrans- 
formed to a decibel scale, and the decay time-constant was estimated using a norilinear 
least square fit in windows of duration 0.4 s (horizontal bar). Estimates were obtained 
by sliding the window forward in steps of one sample. Note that the time at which an 
estimate is reported for any given window is the end point of the window. The estimate 
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Figure 6: Illustration of decay time estimation when a terminating phoneme is encountered The 
word "bough" recorded under anechoic (clean) conditions (top row) has a gradually decaying 
offset. The envelope was extracted by filtering the absolute value of the analytic signal (sjecond 
row, heavy outline), and its decay rate was estimated for the segment following the dashed line 
using two methods (bottom row). Overlapping segments (duration given by bar, with stejp size 
Indicated by the thickness of the vertical end) were converted to a decibel scale and thejdecay 
time-constant obtained by a least squares fit to a straight Hne (dotted trace). The same segments 
were analyzed using the MLE algorithm to obtain a second estimate of the decay time-constant 
(solid trace). While the estimators provide somewhat different results, they are in qualitative 
agreement Both methods suggest that the fastest decay time-constant is in the range of 50-j70 ms 
(see also Fig. 5). These results suggest that speech segments will introduce a bias when estimation 
is carried out in reverberant environments. j 

for the window indicated by the bar, for instance, is plotted at time t = 0.85 s. A cUrve of 
the estimated time-constants was thus obtained (dotted curve, bottom panel). The MLE 

i 

procedure was applied to the same segments and produced an independent estimate of 
the the decay time-constant (solid line, bottom panel). While the estimates differ jsome- 
what, they are in qualitative agreement. Both procedures indicate that the terminating 
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vowel had a time-dependent decay rate, and the greatest rate was between 50 and 7p ms. 

The results confirm the presence of the peak in Fig. 5 (left column, bottom panej), al- 
though the histogram shown in Fig. 5 was obtained for a sequence of 15 words.} The 
analysis shown in Fig. 6 also indicates the reason for estimation bias under reverberant 
conditions using speech samples. Taken together, the results suggest that the factors re- 
sponsible for estimation performance are: the presence of adequate numbers of jgaps, 
sharp offset transients, and estimation window length. j 

! 

i 
i 

E. Validation of method j 

i 
! 

The results demonstrate that estimation of decay time in room response is possible for a 

variety of sounds including impulses, noise bursts and speech. While we have show)i that 

i 

a reasonable agreement exists with a nonlinear least squares fit to the data (Fig. 6) . a! more 
careful evaluation Is necessary to determine conditions under which the MLE procedure 
is likely to provide accurate estimates. Here we establish that the estimated decayjtimes 
are comparable to decay times obtained from Schroeder's method [18]. Furthermore, any 
data collected must be under sufficiently realistic conditions where there Is background 
noise and where the testing sound is not subject to experimental control. A comparison of 
MLE performance with the standard method in real environments will therefore establish 
the utility of the method. j 

We compared the estimates using the backward integration method proposed by 

i 

Schroeder 118] in both single-channel (Le., the broad-band signal), and multi-channel 
frameworks (i.e., narrow-band signals occupying ISO one-third octave bands). I Since 
Schroeder's method requires a fitting procedure to estimate the time-constant a, decay 
range (either 20 or 30 dB below a reference level of -5 dB) was selected (see Secticln TH). 
The MLE procedure does not require the specification of such a range. We first report on 
the comparison between the methods using a hand-clap in a small office (8x3 m). Subse- 
quently we will summarize results obtained in a number of rooms of different sizei 

Figures 7A, B depict a hand-clap event and its spectrogram, respectively. The data in 
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panel A Is the same as shown in Fig. 1 A. except that Fig. 7A also includes the direct s<iund. 
The noise level in the room was 50 dB SPL, and peak sound pressure level resultingjfrom 
the hand-clap was 85 dB SPL The decay curve obtained using Schroeder's backward 
integration method is shown in Fig. 7C, normalized so that peak SPL was 0 dB. This is the 
broad-band curve obtained by integrating the recorded microphone signal. A straight- 
line fit to ihe 20 dB drop-off point (circle) from a reference level of -5 dB (lozenge) yielded 
r ^ 56 ms (Too = 0.39 s). The discrepancy between this value and that presented in Fig. 1 
(r = 59 ms) was due to the Jnclusion of the direct sound in Fig. 7. The windowsj over 
which the 20 dB drop-off was computed were not identical for the two cases, Thd data 
were run through the MLE procedure and a histogram of estimates was obtained and 
the decay time-constant was calculated from the peak of the histogram using (17)j This 
gave an estimate r = 53 ms (Tec = 0.37 s), which is in good agreement with the estimate 
obtained from Schroeder's method. Note that the estimates reported in this wojk are 
based on a single realization of sound events. The normal practise is to average over 
lat^e numbers of trials. Since our goal is to develop an online estimation procedure, we 

felt that it would be more realistic to use a single trial. ! 

i 

To test a range of room KTs, ISO one-third octave band analysis (exceeding 1 kHz Renter 
frequency) was performed in three environments. These were 1) the moderately reverber- 
ant room described above (Fig. 7), 2) a highly reverberant circular foyer, and 3) a iighly 
reverberant enclosed cafeteria. In all cases, the signal was a hand-clap generated \ m in 
front of the recording microphone, with peak energy exceeding 90 dB SPL. Output from 
the band-pass filters were analyzed using the MLE procedure, and the r value fof each 
band was obtained from the histogram by selecting the dominant peak. For Schrofeder's 
method, a 20 dB decay range was used. Figure 8 shows the T m estimates from Schrieders 
method (abscissa) versus the ML estimates (ordinate) for each ISO one-third bandj(open 
symbols), and the average over these bands (closed symbols). 

The figure shows that the variability of estimates for highly reverberant environments 
increases with increasing mean FT for both methods. However, the two methodsjare in 
good agreeement especially in the high-frequency bands (the single outlier falling below 
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Figure 7: Estimation of decay time-constant from real room data. (A) The room response to a 
hand-clap (same as Fig. 1A but includes the direct sound). (B) Spectrogram of the hand-clap 
demonstrates a sharp broadband onset transient and the decay as a function of frequency. (£) The 
decay time-constant was estimated using Schroeder's backward impulse integration method In 
the -5 dB (Jozenge) to -25 dfi (circle) range, followed by a least squares fit to a straight line to 
obtain the time-constant (r = 50 ms, Tfeo — 0.39 s). (D) Histogram of decay times obtained from 
signal shown in A using MLE. The median value of the histogram (arrow, t = 53 ms, T e o =j= 0.37) 
is in good agreement with the estimate obtained using Schroeder '$ method. | 

the diagonal in Fig. 8 is the lowest center frequency used in the analysis, namely 1 kHz). 
The agreement between the methods Is best when the Tco values are averaged over all 
bands (filled symbols), as is usually reported in the literature. 

A more extensive test to determine the variability in estimates across different envi- 

i 

ronments, and between bands, was performed in 12 environments, including small office 
rooms, an auditorium, large conference rooms, corridors, and building foyers. Thje data 

i 

were analyzed as in Fig. 8 and are shown in Fig. 9A. In comparison with Schrobder's 
method, the MLE procedure consistently overestimated T w in low to moderately (rever- 
berant environments {T& < 0.3 s) whereas it underestimated the reverberation tiijne for 
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Figure 8: Comparison of Schroeders method and the MLE procedure for T w times obtained 
in one-third octave bands. Three environments were tested: a moderately reverberant en rtron- 
merit (circles; the environment is the same as shown in Fig. 7), a highly reverberant circular foyer 
(squares), and a highly reverberant enclosed cafeteria (diamonds). In each environment, a single 
hand-clap was filtered using a bank of ISO one-third octave band-pass filters with center fire quen- 
cies exceeding 1 kHz. Ordinate shows the best estimates obtained from the MLE procedt re for 
each band. Abscissa shows the T«o times obtained from Schroeder's method Averages ofer all 
bands for each environment are shown as filled symbols. The diagonal dashed line is shown for 
reference, and points lying close to this line suggest good agreement between the two prow lures. 
Agreement is best when the T$o values are averaged over all the bands. 

more reverberant environments (T^o > 1.3 s). There was good agreement between the 
two methods for intermediate ranges. The average T M over all bands (filled squares) 
were however, in good agreement. Broad-band estimates were made using the sam a pro- 
cedures but without band-pass filtering of recorded signals. These are shown in Fig. 9B. 
The trend in the estimates was similar to that observed with narrow-band signals, ex- 
cept for one outlier. The outlier along with three other data points were obtained in a 
large auditorium. The latter three were obtained with a source-to-microphone distance of 
1.5 m, whereas for the outlier the distance was 4 m. The sound levels were not a^usted 
to compensate for the distance, and hence the test corresponding to the outlier toas at 
a lower SPL, resulting in reduced dynamic range (peak to noise floor). For thessj tests. 
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Figure 9: Reverberation time estimates from real environments. Seventeen tests in 12 environ- 
ments were conducted using noise bursts. Decay time-constants were estimated using thcMLEal- 
gorithm (ordinate) and the extrapolated T«, times were compared with estimates from Schro sder 's 
method (abscissa). (A) Estimates of T w in one-third octave bands with center frequencies e cceed- 
ing 1 kHz (open circle) and their average (filled square). (B) Broad band estimates of Tbo fa im the 
recorded room response. Averaged narrow band estimates are more accurate than hroac band 
estimates, presumably due to the presence of low frequency components in the latter, Furth ar, the 
MLE method over-estimates T 60 (in comparison with Schroeder 's method) when room revfbera- 
tion Is moderate (< 0.3 s) whereas for higher values of (> 1.3 s) there is reasonable agreement 
The single outlier is due to inaccuracies in the Schroeder estimate (see text for discussion). Results 
are from one presentation of noise burst in each test 

i 

the Schroeder estimates of T w (in seconds) were 2,18 (outlier), and 0.39, 0.39, and 0.33, 
respectively. The ML estimates, on the other hand, were 0 69 (oudter). 0.77, 0.8. and 0.67, 
respectively. Schroeders method was inaccurate due to the reduced dynamic range. On 
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the other hand, the ML estimates, while larger than the Schroeder estimates, were consis- 
tent and relatively robust to the reduction in dynamic range. 

These results raise the issue of estimation in narrow bands. It appears, althougl 1 it is 
by no means conclusive, that the upper one-third octave bands (over 1 kHz) may provide 
more accurate estimates than the lower bands. Since frequency decomposition is a stan- 
dard part of most audio signal processing algorithms, such as found in beamform ng, it 
may be useful to track estimates in the higher frequency bands, or in select bands \ /here 
the energy is greatest Tracking high-energy bands is likely to provide more temporal 
range in tracking decays before encountering the noise floor, and thus sharpen the peak 
in the histogram of estimates. Alternatively, averaging over all high-frequency banc s can 
provide estimates that are in close agreement with T M times obtained from Schro der's 
method. 

The findings suggest that there is good correlation between the estimates obtained 
from the MLE procedure and those obtained from Schroeder s method. While it Is not 
possible to determine which method provides greater accuracy, we suggest that tie val- 
ues arc correlated and in general agreement 

F. Estimation of ST from connected speech in real listening environments 

The results presented in the preceding sections indicate that the ML estimator c utput 
is in good agreement with actual or simulated room KB. In particular, the estimator 
can be applied to isolated word utterances, even though the naturally decaying officts of 
terminating phonemes leads to an over-estimation of RT (see Fig. 6). Here, we test tl e per- 
formance of the procedure explicitly in a challenging estimation task, namely estimating 
room RT from connected speech. 

A segment of speech (about 50 s in duration) from the Connected Speech Test (CST) 
corpus was played back in a partially open, circular foyer (one-third octave band analysis 
shown in Fig, 8, square symbols). The FT for this environment was first estimated with 
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Figure 10: Evaluation of room reverberation time (RT) from connected speech played b ick in 
a partially open circular foyer. The RT for this environment as measured from hand-claj $ was 
1.66 ± 0.07 s (Schroeder's method) and L62 s (from ML procedure). (A) Trace of CST pissage 
(duration 50 s) recorded in the environment Bar indicates 1 s. (B) The histogram of ML estimates 
over the duration of recording. The first peak in the aggregate histogram is the best RT es imate 
from connected speech (1.83 s). The horizontal bar is the range of RT estimates obtainec torn 
Schroeder's method, and the triangle indicates the ML estimate. (C) Peak values from hist >gram 
of estimates were obtained every 1 s. and the 50 peak values were used to produce the histogram 
shown. The best estimate of RT from this histogram is at the dominant peak (1.7 s), which is closer 
to the estimates obtained from pure decay curves. Thus, using short-term histograms as in (C) is 
more reliable than the long-term histogram shown in B. Overall, the results indicate that t >e ML 
estimator produces reliable estimates with connected speech. 

hand-claps using Schroeder's method (1.66 ± 0.07 s) and independently confirmed with 
the ML procedure (estimated RT from histgoram was 1.62 s). The ML procedure wa s then 
applied to the recorded speech data (Fig- 10A). A histogram of room time constants for 
the duration of the recorded data was constructed (Fig. 10B). The order-statistics filter 
was used to select the first dominant peak in the histogram (RT = 1.83 s). This is tf e best 
RT estimate based on the aggregate data. It is possible to refine the procedure for at riving 
at the best estimate by applying the order-statistics filter at much shorter time intervals. 
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Towards this end, a histogram was constructed at intervals of 1 s, and the best RT esti mate 
for this interval was obtained. The resulting best estimated from all one-second durations 
(50 in all) were binned to produce the histogram shown in Fig. IOC. It can be seer that 
the vast majority of the estimates were at FT = 1.7 s, which agrees with the mean ^ralue 
of 1.66 s from Schroeder's method (using hand-daps), and is well within its stardard 
deviation (0.07 s; the one-sigma interval is indicated by the horizontal bar in Fig. 10 J.C). 

Given that terminal phonemes have a natural decay rate (see Fig. 6), it is noi sur- 
prising that the ML procedure produces estimates somewhat larger than the real room 
RT, Further, the discrepancy between the actual RT and those estimated from connected 
speech arise from the absence of adequate numbers, and the limited duration, of gap s (see 
Fig. 10A). Thus, regions of free decay where estimation is accurate are limited. Not with- 
standing these constraints, the procedure works well in part due to the decision-making 
capability buUt into the order-statistics filter By selecting the first dominant peak (from 
the left) in the histogram, the filter in effect rejects spurious estimates and so reduces 
the error in the estimation procedure. The mean value of the histogram or its median, 
for instance, would result in significantly higher estimates of RT. The performance of the 
order-statistics filter can be further improved if one were to obtain a statistical chai acter- 
ization of gap duration from a large corpus of connected speech or other sounds- Such a 
characterization could provide a robust percentile cut-off value (see Eq. 17) which could 
then be used to select the best RT value for the room. 

In conclusion, the ML procedure in combination with the order-statistics filte: ; pro- 
vides a robust means for blind estimation of room RT. The procedure has been val dated 
against Schroeder's method, and with real room data such as hand-claps, isolated] word 
utterances, and connected speech. 



V. DISCUSSION 



A method for the blind estimation of reverberation time was presented. The method 
models the revereberant part of the room response, i.e., the diffusive tail assumir g that 
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the response Is an exponentially damped gaussian noise sequence. The input wss as- 
sumed to be an ideal impulse. The objective was to determine the decay time-consts nt of 
the exponential. Using the statistical description of the process, a maximum-likelihoc d es- 
timator (MLE) was derived, and solved numerically to arrive at the decay time-con rtant 
This was then extrapolated to obtain the reverberation (T^) time. 

The estimation of reveberation time is a widely investigated problem. TVaditionally, 
two approaches have been taken. The RT is computed analytically using formulae that 
incorporate the geometry and absorptive characteristics of the reflecting surfaces, or a test 
sound with known properties is radiated into the environment, and the RT is estir rated 
from the received sounds. The former approach is embodied in the Sabine type fori lulae 
([4 f 9, 17, 20]; see [7, 23] for reviews), while the latter is based on Schroeder's decay rurve 
analysis ([2, 18, 22]). In both approaches, prior knowledge of the environment cr test 
sound is required, For example, in Sabine type formulae, the volume, surface area and 
absorpUon coefflecients must be known; and in Schroeder's decay curve analysis, the 
test sound must be uncorrected noise that is abruptly switched off at a known time, and 
followed by a sufficiently lengthy pause to track the decay. Thus, the methods a-e not 
suited for RT estimation from passively received sounds tn unknown environments (i.e., 
blind estimation). 

The MLE procedure removes these difficulties since it is based on a widely accepted 
model of the reverberant tail, namely the exponential decay model (see [23] for a discus- 
sion on how the Sabine type formulae are related to a linear decay of the sound pn assure 
level after the source is turned off). Here, it is assumed that the amplitude of succ issive 
reflections are damped exponentially, while the fine structure is a random uncon: dated 
process. Since the model is a good approximation of reverberation in most diffusive en- 
vironments, the method presented here provides a framework for blind estimation with 
wide applicability Hie success of the approach also derives from the analytically tra ctable 
nature of the maximum-likelihood formulation, reducing the problem to the estir lation 
of a single parameter that can be determined computationally We also showed that for 
ongoing and onset segments of the sound, the estimates will assume implausible values 
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since the model is not valid in these regions. However, an order-statistics filter diwn 
stream to the ML estimation can reject these estimates and extract the room KT 
proved confidence. This is based on the intuitive idea that sounds cannot decay 
than the rate prescribed by the room time-constant, and thus, selecting the 
improves the confidence of the estimates. To our knowledge, this approach has not] been 
reported in the literature. 

The two encouraging results of this study are the results obtained u: 
and the validation of the estimates using Schroeder's method. Speech sounds present par- 
ticular problems to most estimation algorithms since they violate the two most com* lonly 
held assumptions, namely stationarity and Gaussian statistics. Further, even abruptly 
minating phonemes such as stop consonants demonstrate a decay at the cessation 
sound. Such decays may be in the range of 5-40 ms and can increase the overall 
time estimated in reverberant environments. However, except for the increase 
mated decay time (a variation up to about 15% for sounds terminating in /d/) the 
tag and histogram procedure works rather well, indicating that the method is rela|th/ely 
robust to model uncertainties. 

Partially blind approaches to KT estimation are also available. 1) A neural network 
be trained to learn the characteristics of room reverberation ([3. 12]). Here, it is 
to train the network whenever the environment changes. 2) The signal is explicitly 
mented to identify gaps wherein decays can be tracked ((8]). It should be noted, 
order-statistics filter developed in this work performs an implicit segmentation 
nal by rejecting estimates that are implausible. 3) A blind dereverberation procedure 
be used to obtain the room Impulse response. However, the room impulse response 
be minimum phase, a condition that most listening environments fail to satisfy 

The ML procedure presented here is just one method for estimating room RT. 
methods are also possible. For Instance the envelope of the sound can be 
the estimation interval, converted to sound pressure level, and a regression line 
be fitted to obtain the T«j time. This is a blind version of the RT estimation 
followed by Lebart et al. ([8J), The order-statistics filter can be applied to the histogram 
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estimates as with the ML procedure. The method is non-parametric and so is not si bject 
to model uncertainties. This approach was used to estimate the decay rate of isolated 
word utterances (Fig. 6). Further work is needed to compare this procedure to th& ML 
procedure. 

The ML procedure is model-based and is expected to perform reasonably weU ih dif- 
fuse sound fields (i.e., uniform with respect to directional distribution) and where i sin- 
gle time-constant describes the reverberant tail. For most sound fields this is a resonable 
approximation (see [7] for a discussion on this point). The estimates of are in reason- 
able agreement with Schroeder's method in most of the listening environments tested, 
including challenging situations where the source or recording microphone was cl>se to 
a wall, or there was moderate background noise (see Fig. 7). Further, it provided con- 
sistent estimates even when the dynamic range of sound decay was reduced, in co itrast 
to Schroeder's method which provides inaccurate estimates under these conditions (see 
Fig. 9 and accompanying text). While the ML procedure produces best results when there 
are isolated impulsive sounds or abruptly terminating white noise bursts, the results of 
tests with isolated word utterances and connected speech are in good agreement with the 
actual 2V Thus, the procedure is expected to work under most listening conditions. 

The method proposed here will perform poorly when there are room resonances and 
the sound pressure level decays nonlinearly with time. This can be a result of the room 
geometry, or positioning the recording microphone in a region of the sound field that 
is nondiffusive (e.g., against a reflecting surface). In addition to model failure, thfe per- 
formance of the estimator may be poor when there are insufficient numbers of gsps, or 
high levels of background noise. Good performance results when there are abot 1 10% 
gaps and the peak sound level (at the time of offset) is about 25 dB SPL over the noise 
floor. Performance may also be poor when background noise is modulated (such a s with 
background music or babble), since the procedure will attempt to track any modulation 
present in the environment and hence produce multi-modal histograms with pealjs that 
may not be easily discriminated. 

The blind estimation procedure suggested here can be applied in a number oflsitua- 
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tions. Since only passive sounds are used f any audio processor that has access to micro- 
phone input can estimate the room reverberation time, either in single-channel (broad- 
band) or multi-channel (narrow-band) mode. One of the most interesting applications is 
in the selection of signal processing strategies tailored to specific listening environments. 
These include hearing aids and hands-free telephony. Most modern hearing aids have 
the ability to switch between several processing schemes depending on the Ustenir g en- 
vironment. For instance, in highly diffusive environments, where the source to li tener 
distance exceeds the critical distance, adaptive beamformers are ineffective. In thfc situ- 
ation, it would be convenient to switch off the adaptive algorithm and revert to s mple 
delay-and-sum beamforming. Such decisions can be made if there is a passive method 
for determining room reverberation characteristics. Other potential applications could 
include hands-free telephony; and sound level meters, A limitation of the methoc is its 
relatively poor performance with narrow-band signals whose center frequencies are be- 
low 1 kHz, However, the performance is good for broad-band signals, and narrowband 
signals whose center frequncies exceed 1 kHz. 

The computational costs of implementing the procedure are largely due to the : 
tive solution of the maximum-likelihood equation. We have developed fast algorithms 
for reducing the computational cost so that the procedure can be implemented in real- 
time (forthcoming publication). Thus, the method can be implemeted in passive listening 
devices to determine the reverberation characteristics of the environment 



[11 Bolt RK, MacDonald AD (1949). Theory of speech masking by reverberation. J. Acouk Soc 
Amer. 21: 577-580 

[21 Chu WT (1978). Comparison of reverberation measurements using Schroeder's iijipube 
method and decay curve averaging method. J. Acoust Soc. Amer. 63(4): 1444-1450 

[31 Cox TJ, Ll F, Darlington P (2001). Extracting room reverberation time from speech usir|g arti- 
ficial neural networks. J, Audio Eng. Soc. 49(4): 219-230 

[4] Eyring CF (1930). Reverberation time in "dead" rooms. J. AcousL Soc. Amer. 1 (2): 217124! 



itera- 



0 4/25/2 003 12:1 6 FAX 217 265 5530 



OFC TECH MGT UIUC 



®047 



n en- 



(5] Knudsen VO (1929). The hearing of speech In auditoriums. J. Acoust Soc Amer. 1: 56-^2 
[6] Kuttruff H (1995) . A simple iteration scheme for the computation of decay constants i 

closures with diffusely reflecting boundaries. 
[7] Kuttruff H (1991). Room Acoustics. London: Elsevier Science Publishers Ltd, Third Ecjition. 

J. Acoust Soc. Amer. 98(1): 288-293 
(8] Lebart K. Boucher JM, Denbigh PN (2001). A new method based on spectral subtraction for 

speech dereverberation. Acustics 87: 359-366 
[9] Millington G (1932). A modified formula for reverberation. J. Acoust Soc Amer. 4(1): 69-82 
[10] MiyoshiM, KanedaY (1988). Inverse filtering of room impulse response. EEEE Trans. Apoust 

Speech and Sig. Proc. 36(2): 145-152 
(1 11 Nabalek AK, Letowski TR. Tucker FM (1989). Reverberant overlap- and self-masking ih con- 
sonant identification. J. Acoust Soc. Amer. 86(4); 1259-1265 
[12J Nannariello J, Fricke F (1999). The prediction of reverberation time using neural ne^owrk 

analysis. Appl. Acoust 58: 305-325 
[13] Neely ST, Allen JB (1979). Invertibility of room impulse response. J. Acoust Soc. 
66: 165-169 

[14] Pitas I, Venetsanopoulos AN (1992). Order statistics in digital impage processing. Procj 1 
80(12): 1893-1921 

[15] Poor V (1994). An Introduction to Signal Detection and Estimation. New York: Springer-Aferlag 
[16J Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992). Numerical Recipes in C « 

bridge: Cambridge Univ. Press 
117] Sabine WC. Collected Papers on Acoustics. Cambridge: Harvard University Press. 
118] Schroeder MR (1965). New method for measuring reverberation time. J. Acoust Soc. 

37: 409-412 

(19] Schroeder MR (1966). Complementarity of sound buildup and decay J. Acoust Soc 
40(3): 549-551 

[20) Sette WJ (1933). A new reverberation time formula. J. Acoust Soc. Amer. 4(3): 193-210 
121] Tahara Y, Miyajima T (1998). A new approach to optimum reverberation time characteristics. 



37 



\mer. 



IEEE 



Cam- 



Ajner. 



Amer. 



04/25/2003 12:16 FAX 217 265 5530 OFC TECH MGT UIUC 



38 



Appl. Acousl 54: 113-129 
[22] Xiang N (1995). Evaluation of reverberation times using a nonlinear regression approach. J. 

Acousl Soc Amer. 98(4): 2112-2121 
[23] Young RW (1959). Sabine reverberation equation and sound power calculations. J. A<}oust 

Soc. Amer. 31(f): 912-921 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 
LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



